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Demographic models built from genetic data play important roles in illuminating prehistorical 
events and serving as null models in genome scans for selection. We introduce an inference method 
based on the joint frequency spectrum of genetic variants within and between populations. For 
candidate models we numerically compute the expected spectrum using a diffusion approximation 
to the one-locus two-allele Wright-Fisher process, involving up to three simultaneous populations. 
Our approach is a composite likelihood scheme, since linkage between neutral loci alters the variance 
but not the expectation of the frequency spectrum. We thus use bootstraps incorporating linkage 
to estimate uncertainties for parameters and significance values for hypothesis tests. Our method 
can also incorporate selection on single sites, predicting the joint distribution of selected alleles 
among populations experiencing a bevy of evolutionary forces, including expansions, contractions, 
migrations, and admixture. 

We model human expansion out of Africa and the settlement of the New World, using 5 Mb of 
noncoding DNA resequenced in 68 individuals from 4 populations (YRI, CHB, CEU, and MXL) 
by the Environmental Genome Project. We infer divergence between West African and Eurasian 
populations 140 thousand years ago (95% confidence interval: 40 - 270 kya). This is earlier than 
other genetic studies, in part because we incorporate migration. We estimate the European (CEU) 
and East Asian (CHB) divergence time to be 23 kya (95% c.i.: 17 - 43 kya), long after archeological 
evidence places modern humans in Europe. Finally, we estimate divergence between East Asians 
(CHB) and Mexican-Americans (MXL) of 22 kya (95% c.i.: 16.3 - 26.9 kya), and our analysis 
yields no evidence for subsequent migration. Furthermore, combining our demographic model with 
a previously estimated distribution of selective effects among newly arising amino acid mutations 
accurately predicts the frequency spectrum of nonsynonymous variants across three continental 
populations (YRI, CHB, CEU). 

Abbreviations: AFS, allele frequency spectrum; CEU, Utah residents of Northern and Western 
European ancestry; CHB, Han Chinese from Beijing, China; EGP, Environmental Genome Project; 
kya, thousands of years ago; LD, linkage disequilibrium; MXL, Los Angeles residents of Mexican 
ancestry; SNP, single nucleotide polymorphism; YRI, Yoruba from Ibadan, Nigeria 



I. INTRODUCTION 

Demographic models inferred from genetic data 
play several important roles in population genet- 
ics. First, they complement archeological evidence 
in understanding prehistorical events (such as the 
number and timing of major continental migra- 
tions) which have left no written record [Il[2]- Sec- 
ond, they facilitate the search for genetic regions 
that have been targets of non-neutral forces, such 
as recent natural selection, by guiding our expecta- 
tions as to how much sequence and haplotype vari- 
ation one expects to see in a given genomic region 
(and, more importantly, the variance around these 



expectations) [3 . Finally, existing demographic 
models can guide sampling design for subsequent 
population or medical genetic studies. 

Given their many uses, it is not surprising that 
many studies have inferred demographic models for 
populations of humans and other species [H [51 [SI 

EllHllSllIlinillllllllllli]. 

The process of inferring a demographic model 
consistent with a particular data set typically 
involves exploring a large parameter space by 
simulating the model many times, often using 
coalescent-theory based Monte Carlo approaches. 
For computational reasons, many of the demo- 
graphic inference procedures developed thus far 
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have focused on single population models or mod- 
els with multiple populations but no subsequent 
migration after subpopulations spht (i.e., [H [5l 
H H E], but also see [M [TH]). Methods 
that do consider multiple populations with migra- 
tion often assume independent non-recombining 
regions [II [19] and do not often scale to genomic 
size data sets. Approaches for jointly considering 
recombination and migration often use a restricted 
set of summary statistics [3] of the data, which 
limits their statistical power. Finally, complex de- 
mographic inferences that make use of many sum- 
mary statistics are often very computationally in- 
tensive [8j [101 [H] J which precludes thorough inves- 
tigation of their statistical properties. 

Here, we develop and apply a computation- 
ally efficient diffusion-based approach to the prob- 
lem of demographic inference, based on the multi- 
population allele frequency spectrum (AFS) (i.e., 
the joint distribution of allele frequencies across 
SNPs) [ini [m [ISI [Ml [H]- Given a genetic region 
sequenced in multiple individuals from each of P 
populations, the resulting AFS is a P-dimensional 
matrix. Each entry of this matrix records the num- 
ber of diallelic genetic polymorphisms in which 
the derived allele was found in the corresponding 
number of samples from each population. For ex- 
ample, if diploid individuals from two populations 
were sequenced, with 10 individuals from popula- 
tion 1 and 5 from population 2, the AFS would 
be a 21-by-ll matrix (indexed from 0). The [2,0] 
entry would record the number of polymorphisms 
for which the derived allele was seen twice in pop- 
ulation 1 but never seen in population 2, while 
the [20,5] entry would record polymorphisms for 
which the derived allele was homozygous in all in- 
dividuals from population 1 and seen 5 times in 
population 2. If all polymorphic sites possess only 
two alleles and can be considered independent, the 
AFS is a complete summary of the data. Many 
of the statistics commonly used for population ge- 
netic inference, such as Fst and Tajima's D, are 
summaries of the AFS (see [rSll^ l. 

Efficient techniques exist for simulating the AFS 
of a single population [11[51[3S]. The joint AFS be- 
tween two populations has been used by several 
recent studies [101 [H [H [21], but these have all 
relied upon very computationally intensive coales- 
cent simulations. Here we approximate the joint 
multi-population AFS by numerical solution of a 
diffusion equation, and our implementation sup- 
ports up to three simultaneous populations. Be- 



cause the diffusion approach neglects linkage, our 
comparison with the data is through a composite 
likelihood function. Such likelihoods are consistent 
estimators under a wide range of population ge- 
netic scenarios for selectively-neutral data, but do 
not correctly capture variances |25j . (Lower recom- 
bination induces higher linkage and higher vari- 
ance among the expected entries of the AFS.) As 
we demonstrate below, the efficiency of our diffu- 
sion approach enables both conventional and para- 
metric bootstrap resampling of the data, allowing 
us to accurately estimate confidence intervals for 
parameter values and critical values for hypothe- 
sis tests [26], accounting for any degree of Hnkage 
found in the data. This bootstrap procedure over- 
comes the traditional concerns with composite like- 
lihood as a philosophy for inference in population 
genetics 

To demonstrate the utility of our approach, we 
apply our method to two epochs in human his- 
tory, using single nucleotide polymorphism (SNP) 
data from the Environmental Genome Project 
(EGP) [27], the largest public database of human 
resequencing data. We first study the expansion of 
humans out of Africa, jointly modeling the history 
of African, European, and East Asian populations. 
We then study the settlement of the New World, 
jointly modeling European, East Asian, and ad- 
mixed Mexican populations. In both cases, we 
quantify the uncertainty of our parameter infer- 
ences and test hypotheses about migration (boot- 
strapping to account for linkage). In particular, 
we infer an earlier divergence between African and 
Eurasian populations than previous studies, be- 
cause our inferences account for the substantial 
migration between these populations. Our meth- 
ods also find no evidence for multiple migrations 
between East Asia and the New World. While sim- 
ilarly complex models for human continental pop- 
ulations have been studied [5], to our knowledge, 
our analysis is the first in which the full joint AFS 
is used for inference and in which uncertainty and 
goodness-of-fit have been quantified. 

An important advantage of the diffusion ap- 
proach is the ease with which selection can be 
incorporated. As an illustrative application, we 
also predict the distribution of protein-coding vari- 
ation between populations. In agreement with the 
data, we find that less nonsynonymous variation 
is shared between populations than might be ex- 
pected based only on patterns of shared noncoding 
variation. 
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FIG. 1. Frequency spectrum gallery. A) Qualitative effects of modeled neutral genetic forces on 
(j){xi,X2,t), the density of alleles at relative frequencies Xi and X2 in populations 1 and 2. B) For the 
spectra shown, an equilibrium population of effective size Na diverges into two populations 2Nat 
generations ago. Populations 1 and 2 have effective sizes viNa and V2Na, respectively. Migration is 
symmetric at m — M/{2Na) per generation, and 9 = 1000. C) The AFS at r = 0. Each entry is colored 
by the logarithm of the number of sites in it, according to the scale shown. D) The AFS at various 
times for various demographic parameters, on the same scale as B. E) Comparison between coalescent- 
and diffusion-based estimates of the likelihood £ of data generated under the model A. 
Coalescent-based estimates of the likelihood, each of which took approximately 7.0 seconds, are 
represented in the histogram. The result from our diffusion approach, which took 2.0 seconds, is 
represented by the red line. For accuracy comparison, the yellow line indicates the likelihood inferred 
from 10® coalescent simulations. 



While no model can capture the full complex- 
ity of any species' genetic history, the models pre- 
sented refine our understanding of the expansion of 
humanity across the globe. None of the method- 



ology is specific to humans, and we expect our 
method will find wide application to demographic 
inference of other species. 



II. METHODS 



A. Diffusion approximation 



To efficiently simulate the AFS, we adopt a diffusion approach. Such approaches have a long and distin- 
guished history in population genetics, dating back to R. A. Fischer [28l [29l |30] . The diffusion approach 
is a continuous approximation to the population genetics of a discrete number of individuals evolving 
in discrete generations. An important underlying assumption is that per-generation changes in allele 
frequency are small. Consequently, the diffusion approximation applies when the effective population 
size N is large and migration rates and selection coefficients are of order 1/A^. 

If we have samples from P populations, the numbers of sampled sequences from each population are 
ni,n2j • • • ,np. (For diploids, ni is twice the number of individuals sampled from population 1.) Entry 
di,d2, ■ ■ ■ ,dp of the AFS records the number of diallelic polymorphic sites at which the derived allele was 
found in c?i samples from population 1, d2 from population 2, and so forth. (If ancestral alleles cannot 
be determined, then the "folded" AFS can be considered, in which entries correspond to the frequency 
of the minor allele.) 

We model the evolution of 4'{xi, X2, . ■ ■ ,xp,t), the density of derived mutations at relative frequencies 
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xi,X2, ■ ■ ■ ,xp in populations 1,2, . . . , P at time t. (All x run from to 1.) Given an infinitely-many- 
sites mutational model |31j and Wright-Fisher reproduction in each generation, the dynamics of (f) for an 
arbitrary finite number of populations P are governed by a linear diffusion equation: 

i=l,2,...,P i=l,2,...,P ^ j = l,2,...,P ' 

The first term models genetic drift, and the second term models selection and migration. Fig[TJ\ illustrates 
the effects of different evolutionary forces on components of Time is in units of r = t/{2Nref)i where 
t is the time in generations and A'^e/ is a reference effective population size. The relative effective size 
of population 1 is i^i = Ni/Nref- The scaled migration rate is Mi^2 = 2iVre/™i«-2, where mi^2 is 
the proportion of chromosomes per generation in population 1 that are new migrants from population 
2. (Thus migration is assumed to be conservative [32]). Finally, the scaled selection coefficient is 71 = 
2Nj.f,fSi, where Si is the relative selective advantage or disadvantage of variants in population 1. Boundary 
conditions are no-ffux except at two corners of the domain, where all population frequencies are or 1; 
these are absorbing points corresponding to allele loss or fixation. Because the diffusion equation is linear, 
we can solve simultaneously for the evolution of all polymorphism by continually injecting (j) density at 
low frequency in each population (at a rate proportional to the total mutation flux 6), corresponding to 
novel mutations. 

Changes in population size and migration alter the parameters in Eqn [l] while population splits and 
mergers alter the dimensionality of (j). For example, if new population 3 is admixed with a proportion / 
from population 1 and 1 — / from population 2 then 

(j){xi,X2,X3,t) = (t){xi,X2,t) 5[x^ - [fxi + (1 - .f)x2]), (2) 

where 5 denotes the Dirac delta function. To remove population 2, <j) is integrated over X2'- 4>{xi,x^,t) = 

4){xi,X2,Xz,t) dX2. 

Given </), the expected value of each entry of the AFS, M[di, dj, . . . , dp], is found via a P-dimensional 
integral over all possible population allele frequencies of the probability of sampling di,dj . . . ,dp derived 
alleles times the density of sites with those population allele frequencies. For SNP data obtained by 
resequencing, these probabilities are binomial, so 

M[d,,d,,...,dp]^ j ■■■ f H h)xf^{l-x,r-''^^{xi,X2,...,xp)dx,. (3) 

In some cases of ascertained data |33j, the resulting bias can be corrected by modifying the above 
equation pTl IM] . 



B. Likelihood- based inference 



Let Q correspond to the parameters of a demographic model we wish to estimate from the observed 
multi-population allele frequency spectrum, which we denote S[di,dj, . . .]. Assuming no linkage between 
polymorphisms, each entry in the AFS is an independent Poisson variable |2Q| , with mean M[di, dj, . . .] 
(which depends on Q). We can, therefore, construct a likelihood function £{Q \ S) using standard 
statistical theory: 

^(ei,)= n n ^"""'^'•••'t;!f'^'^--,t^""'^'-'"' . (4) 

»=o...Pd.=o...n. S[d^,dj,...,dp\. 



In words, our approach consists of calculating the expected allele frequency spectrum AI using a par- 
ticular demographic model (and set of parameter values for that demographic model) using our diffusion 
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approach. We then maximize the similarity between M and the observed AFS, S, over the parameter 
values that Q can take on. Competing demographic models can be chosen from using standard statistical 
theory such as the nested likelihood ratio test or Information Criteria such as the Akaike or Bayesian 
Information Criteria. 

For linked polymorphisms, £ is a composite likelihood. Such likelihoods are consistent estimators 
under a wide range of neutral population genetic scenarios |25j , but simulations incorporating linkage are 
necessary to estimate variances and define critical values for hypothesis testing and model selection. In 
our applications, we estimate variances using simulations from the coalescent simulator ms |35j . 

C. Numerics 

Solving the multi-population diffusion equation is substantially more demanding than the single- 
population case (23j . This is primarily because the boundary conditions are more complex, and the 
numerical grid of population frequencies x must be much coarser to be computationally tractable, be- 
cause it is of P dimensions. For example, a previous single-population study [23| used a uniform x grid 
of order 10^ values between and 1. Extending this grid to a three-population simulation would require 
an infeasible array of size 10^^. Instead, we use a nonuniform grid and extrapolation to enable accurate 
computation using of order 100 values along each dimension, for a final array size of order 10^. 

We solve the diffusion equation on a regular nonuniform grid, using a finite difference scheme [36j 
inspired by the method of Chang and Cooper [ST] (supporting information). Mutations in population 
1 arise at frequency l/(2A^i) = 1/ {2NrefVi)- The diffusion approximation applies when N^ef — > oo, 
but the minimum frequency in our numerical simulation is that of the first grid point, denoted A. To 
overcome this, we extrapolate our results to an infinitely fine grid. We use a quadratic extrapolation on 
the logarithm of the AFS entry, modeling the bias introduced by the finite initial grid point A as 

log Mcaic(A) = log Moo + aA + fcA^. (5) 

Here Mcaic(A) is an AFS element calculated at grid size A and Moo is the extrapolated value. Given three 
evaluations at different grid sizes A, we solve for M^o and use this value when calculating likelihoods. This 
vastly increases both the speed and accuracy of our calculation (supporting information) . While higher- 
order extrapolations may improve accuracy in some cases, they may also be more sensitive to numerical 
noise. Our empirical experience is that a quadratic approximation provides a good compromise between 
accuracy, efliiciency, and robustness. 

The computational cost for a single likelihood evaluation scales as G^^^ where G is the number of 
grid points used. In our experience, for stability and accuracy G should somewhat larger than the largest 
population sample size. Although our theoretical framework extends to an arbitrary number of popula- 
tions, the exponential scaling of computation with P limits our current applications to three simultaneous 
populations. Importantly, our likelihood calculation is deterministic and numerically smooth, so numer- 
ical derivatives can be used in optimization. We use the the quasi-Newton BFGS method |35], which 
converges in order iVp steps, where Np is the number of free parameters. 

Our implementation of these methods, d&d'i, is written in cross-platform Python and C, making use of 
the NumPy [3S], Scipy [33]) and Matplotlib libraries [ID]. It is distributed under the open-source BSD 
license. All calculations herein were performed with da,d\ version 1.1.0. 

We estimated parameter uncertainties by both conventional bootstrap (fitting data sets resampled 
over loci) and parametric bootstrap (fitting simulated data sets). To generate simulated data we used 
the coalescent program ms |35j . a region-specific recombination rate, and the detailed EGP sequencing 
strategy (supporting information). 

The confidence intervals reported in Tables |l] and |ll] derive from a normal approximation to the boot- 
strap results. For the conventional bootstrap, confidence intervals were calculated as: 9* ± 1.96 (t(0*). 
For the parametric bootstrap, biased-corrected intervals were calculated as: 9 — {9* — 9) zL 1.96 cr{9*). The 
maximum-likelihood value is denoted 9, while 9* and (7{9*) denote the mean and standard deviation of 
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the bootstrap results. Aside from the growth rates r, all our model parameters are positive by definition, 
so in those cases we used their logarithms when calculating confidence intervals. 

Pearson's goodness-of-fit test was performed using all 21'^ — 2 = 9259 bins in the AFS. Results are 
similar if we restrict our analysis to entries in which the expected value is > 1 or > 5. 

D. Data 

We used the National Institute of Environmental Health Science's Environmental Genome Project 
SNPs database [H], which results from direct Sanger resequencing of environmental response genes in 
several populations. We considered all diallelic SNPs in 5.01 Mb of sequence from noncoding regions of 
219 autosomal genes (supporting information). These data have been the subject of many publications, 
including [T71 [23J [571 SI] • As an assessment of quality, additional high-coverage short-read sequencing 
has recently been performed across 8 samples in this data set. Over 26,000 sites, the SNP concordance 
between this next-generation sequencing and the original Sanger sequencing averages 99.5% (D. Nickerson, 
personal communication). Given the high quality of this data set, we do not incorporate sequencing error 
into our modeling. We believe such correction will be essential in future applications to less accurate 
short-read sequencing data, as inference based on the frequency spectrum is sensitive to rare alleles. 

To estimate the ancestral allele, we aligned to the panTro2 build of the chimp genome 143]. Like 
other methods based on the unfolded AFS, our analysis is sensitive to errors in identifying the ancestral 
allele. We statistically corrected the AFS for ancestral misidentification [T7], using a context-dependent 
substitution model [44]. This procedure has been shown to perform better than aligning to multiple 
species [17j . To account for missing data and ease qualitative comparisons between populations, we 
projected all spectra down to 20 samples per population |S] (supporting information). 

The human-chimp divergence in the data is 1.13%. We assumed a divergence time of 6 My jlS] and 
a generation time of 25 years. This yielded an estimated neutral mutation rate of /i = 2.35 x 10~® per 
site per generation, which is comparable to direct estimates [IB]. There is some controversy as to the 
appropriate generation time to assume in human population genetic studies |171[1H|- In particular, the 
human generation time may differ between cultures and may have changed during our biological and 
cultural evolution. The bootstrap uncertainties reported in Tables |l] and |ll] do not include systematic 
uncertainties in the human-chimp divergence or generation times. The generation time, however, formally 
cancels when converting between genetic and chronological times. 

E. Nonsynonymous polymorphism 

In our prediction of the distribution of nonsynonymous polymorphism, the distribution of selective ef- 
fects assumed was a negative-gamma distribution with shape parameter a = 0. 184 and scale /? = 8200 [49] . 
The AFS was calculated by trapezoid-rule integration over this distribution, using 201 evaluations loga- 
rithmically spaced over 7 — [—300, —10^^]. All demographic parameters, including the scaled mutation 
rate 9, were set to the maximum-likelihood values from our Out of Africa analysis. 



III. RESULTS 

First, we explored how various demographic 
forces affect the AFS, building intuition for our 
subsequent applications to real data. We then 
compared the performance of diffusion versus co- 



alescent methods for evaluating the AFS, finding 
that the diffusion approach is substantially faster. 
We then applied our diffusion approach to infer 
parameters for plausible demographic models for 
the history of continental human populations. We 
first considered the expansion of humans out of 
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Africa and then the settlement of the New World. 
In these applications, we inferred the maximum 
composite-likelihood parameters of our models us- 
ing diffusion fits to the real data. To account for 
linkage in estimating variances and critical values 
for hypothesis tests, we then repeatedly fit both 
conventional and parametric bootstrap data sets. 
Finally, in an application incorporating selection, 
we predicted the distribution of nonsynonymous 
variation between populations in our Out of Africa 
model, finding good agreement with the available 
data. 



A. Demographic effects on the AFS 

In Fig[l] we provide examples of the AFS under 
different demographic scenarios. Fig[Tj3 illustrates 
the isolation-with-migration model for which the 
spectra are calculated. The expected spectrum at 
zero divergence time is shown in Fig [Tp. Fig [Tp 
shows the expected spectrum at various divergence 
times under various demographic scenarios. Qual- 
itatively, correlation between population allele fre- 
quencies declines with increasing divergence time, 
depopulating the diagonal of the AFS. On the 
other hand, migration prolongs and sustains cor- 
relation. Less obviously, AFS entries correspond- 
ing to shared low-frequency alleles distinguish be- 
tween increased migration and reduced divergence 
time (supporting information). Additionally, dif- 
ferences in genetic drift between populations with 
different effective sizes result in asymmetries in the 
AFS. These qualitative features of the AFS are also 
evident in human data; detailed modeling allows us 
to quantify our inference regarding the type, tim- 
ing, and strength of demographic events that are 
consistent with the data. 



B. Computational performance 

The computer program implementing our 
method is named dadi (Diffusion Approximations 
for Demographic Inference). It is open-source and 
freely available at http : //dadi . googlecode . com. 

Fig [Tp compares da,di with a coalescent ap- 
proach to evaluating the likelihood of frequency 
spectrum data. The coalescent simulator ms f35] 
was used to generate a simulated data set from 
the model in Fig [l]3, with parameters i^i = 0.9, 



V2 = 0.1, M = 2, r = 2, 6* = 1000, scaled total 
recombination rate p — 1000, and 20 samples per 
population. Coalescent-based estimates of the ex- 
pected AFS were generated by averaging 10^ ms 
simulations, each run with 9=1 and p = 0. These 
estimates were scaled to 6 = 1000 for comparison 
with the simulated data set. (This procedure is 
substantially faster than simulating with larger 6 
and p.) Each estimate took approximately 7.2 sec- 
onds of computation. The histogram in Fig [Tp 
shows the resulting distribution of estimated like- 
lihoods of the data. Shown by the red line in 
Fig [Tp is the result from our diffusion approach 
(with grid sizes G = {40,50,60}), which took ap- 
proximately 2.0 seconds of computation. The yel- 
low line is the likelihood from 10^ coalescent simu- 
lations, illustrating the high accuracy of our diffu- 
sion approach. (Note that the coalescent approach 
we consider here is not necessarily optimal. We 
are, however, unaware of any such approach that 
is competitive in computational speed with the dif- 
fusion method.) 

The computational advantage of the diffusion 
method is even larger when placed in the con- 
text of parameter optimization. Unlike the coa- 
lescent approach, there is no simulation variance, 
so efficient derivative-based optimization methods 
can be used. As examples, consider our applica- 
tions to human data, which involve 20 samples 
per population. On a modern workstation, fitting 
a single-population three-parameter model took 
roughly a minute, while fitting a two-population 
six-parameter model took roughly 10 minutes. The 
fits of three-population models with roughly a 
dozen parameters typically took a few hours to 
converge from a reasonable initial parameter set. 
This speed allows us to use extensive bootstrap- 
ping to estimate variances, overcoming the limita- 
tions of composite likelihood. 

C. Expansion out of Africa 

Our analysis of human expansion out of Africa 
used data from three HapMap populations: 12 
Yoruba individuals from Ibadan, Nigeria (YRI); 22 
CEPH Utah residents with ancestry from northern 
and western Europe (CEU); and 12 Han Chinese 
individuals sampled in Beijing, China (CHB). Be- 
cause approaches based on the frequency spectrum 
are sensitive to miscalling of the ancestral state, we 
statistically corrected for ancestral misidentifica- 
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FIG. 2. Out of Africa analysis. A) AFS for the YRI, CEU and CHB populations. The color scale 
is as in subfigure C. B) Illustration of the model we fit, with the 14 free parameters labeled. C) 
Marginal spectra for each pair of populations. The top row is the data, and the second is the 
maximum-likelihood model. The third row shows the Anscombe residuals between model and data. 
Red or blue residuals indicate that the model predicts too many or too few alleles in a given cell, 
respectively. D) The observed decay of linkage disequilibrium (black lines) is qualitatively well-matched 
by our simulated data sets (colored lines). E) Goodness-of-fit tests based on the likehhood C and 
Pearson's X"^ statistic both indicate that our model is a reasonable, though incomplete description of 
the data. In both plots, the red line results from fitting the real data and the histogram from fits to 
simulated data. Poorer fits lie to the right (lower C and higher X^). F) The improvement in likelihood 
from including contemporary migration in the real data fit (red line) is much greater than expected 
from fits to simulated data generated without contemporary migration (histogram). This indicates that 
the data contain a strong signal of contemporary migration. 
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tion using an approach that accounts for a myriad 
of mutation and context-dependent biases (such as 
CpG effects) [17] . To ease qualitative comparison 
among populations and account for missing data, 
we projected the data down to 20 sampled chro- 
mosomes per population Because this data 
set is of very high quality (>99% concordance of 
sequenced SNPs with next-generation sequencing 
of the same individuals to high coverage; see Ma- 
terials and Methods), we do not explicitly cor- 
rect for sequencing errors here. We were left with 
17,446 segregating diallelic single nucleotide poly- 
morphisms (SNPs) from effectively 4.04 Mb of se- 
quence. Fig[2]A_ shows the resulting AFS. For ease 
of visualization, the top row of Fig |2p shows the 
two-population marginal spectra. 

There are many possible three-population demo- 
graphic models one could consider for these pop- 
ulations. To develop a parsimonious yet realis- 
tic model, we first considered the marginal AFS 
for each population and each pair of populations. 
Previous analyses found that the YRI spectrum is 
well-fit by a two-epoch model with ancient pop- 
ulation growth O [17] , and we found this as well 
(supporting information) . Previous analyses of the 
CEU and CHB populations found that both popu- 
lations went through bottlenecks [51 HT concurrent 
with divergence [11] . Such models qualitatively fit 
the marginal CEU-CHB spectrum (supporting in- 
formation) . 

Combining these demographic features yields 
the model illustrated in Fig [2|3. The maximum 
likelihood values for the 14 free parameters are 
reported in Table |Tj Qualitatively, the resulting 
model reproduces the observed spectra well, as 
seen in the second and third rows of Fig[2p. (The 
correlation between adjacent residuals is due in 
part to our projection of the data down from a 
larger sample size (supporting information).) Al- 
lowing for asymmetric gene flow yielded very little 
improvement in fit, as did allowing for growth in 
the Eurasian ancestral population or allowing the 
CEU and CHB bottleneck and divergence times to 
differ (data not shown) . 

Our composite likelihood function assumes that 
polymorphic sites are independent. Because it 
thus overestimates the number of effective indepen- 
dent data points, confidence intervals calculated 
directly from the composite likelihood function will 
be too liberal. To control for linkage, we performed 
both conventional and parametric bootstraps. Be- 
cause our sequenced genes are typically well sepa- 



rated, they can be treated as independent, and our 
conventional bootstrap resampled from the 219 se- 
quenced loci. For the parametric bootstrap, simu- 
lated data sets that incorporate linkage and the 
EGP's sequencing strategy were generated with 
ms [35 . 

Table [l] reports parameter 95% confidence inter- 
vals from both the conventional and bias-corrected 
parametric bootstraps. The parametric bootstraps 
yield slightly smaller confidence intervals than the 
conventional bootstrap, suggesting that some vari- 
ability in the data has not been accounted for by 
our simulations. This variability may involve small 
varied selective forces on the sequenced regions, 
or slight relatedness between sampled individu- 
als. The parametric bootstrap results additionally 
show that our method possesses very little bias in 
parameter inference (supporting information). 

As seen in Table |T] the times for growth in 
the African ancestral population and divergence of 
the Eurasian ancestral population {Taf and Tb) 
have particularly wide confidence intervals, likely 
a consequence of the high inferred migration rate 
ruAF-B between the African and Eurasian ances- 
tral populations. Taf shows high correlation with 
the ancestral population size Na, while Tg shows 
no strong linear correlation with any other sin- 
gle parameter (supporting information) . We found 
that 92 out of our 100 conventional bootstrap fits 
yield Naso < Neuoi supporting the contention 
that the CHB population suffered a more severe 
bottleneck than the CEU population [TT] . 

We used several metrics to assess our model's 
goodness-of-fit, in additional to visual inspection 
of the residuals seen in Fig[2p. Fig[2fD compares 
the decay of linkage disequilibrium (LD) in the 
data and in the parametric bootstrap simulations. 
The agreement seen is notable because our demo- 
graphic inference used no LD information in build- 
ing and fitting the model. This LD comparison 
thus serves as independent validation of both our 
model and bootstrap simulations We also asked 
whether the likelihood C found in the real data fit 
is atypical of fits to simulated data. Out of fits to 
100 simulated data sets, 2 produced a smaller like- 
lihood (worse fit) than the real data fit (Fig[2|il), 
yielding a p-value of «0.02. One can craft exam- 
ples in which a likelihood-based goodness-of-fit test 
fails to exclude very poor models [51] . Thus we also 
applied Pearson's goodness-of-fit test, a more 
robust and standard method for data that is in 
Poisson-distributed bins, such as the AFS [35]. In 
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TABLE I. Out of Africa inferred parameters 



parameter" 


maximum 
likelihood 


conventional 
bootstrap 95% 
confidence interval 


parametric bootstrap 
bias-corrected 95% 
confidence interval 


Na 


7,300 


4,400 - 10,100 


6,300 - 9,200 


Naf 


12,300 


11,500 - 13,900 


11,100 - 13,100 


Nb 


2,100 


1,400 - 2,900 


1,700 - 2,600 


Neuo 


1,000 


500 - 1,900 


500 - 1,500 


r-EU (%) 


0.40 


0.15 - 0.66 


0.26 - 0.57 


Naso 


510 


310 - 910 


320 - 750 


T-AS (%) 


0.55 


0.23 - 0.88 


0.32 - 0.79 


niAF-B (xlO-^) 


25 


15 - 34 


19 - 36 


mAF-EU (xlO"^) 


3.0 


2.0 - 6.0 


1.6 - 7.6 


mAF-AS (xlO~'^) 


1.9 


0.3 - 10.4 


0.7 - 6.9*" 


msu-AS (xlO"^) 


9.6 


2.3 - 17.4* 


5.7 - 20.2 


Taf (kya) 


220 


100 - 510 


90 - 410 


Tb (kya) 


140 


40 - 270 


60 - 310 


Teu-as (kya) 


21.2 


17.2 - 26.5 


17.6 - 23.9 



"See Fig[2p for model schematic. Growth rates r and migration rates m are per generation. 
''One low-migration outlier was removed for each of these estimations. 



our case, we must use our parametric bootstraps 
to assess the significance of the sum-of-squared- 
residuals test statistic X^, because many entries 
in the AFS are small and because they are not 
strictly independent. Fig[2j5 shows the bootstrap- 
derived empirical distribution of . Two of the 
bootstraps yielded a larger X'^ (worse fit) than the 
real data fit, giving a p-value of R:!0.02, identical 
to that from the likelihood-based test. (The two 
simulations that yield a higher X^ than the real fit 
are not the same two that yield a lower £, suggest- 
ing that these tests are somewhat independent.) 
In some cases specific frequency classes of SNPs, 
such as rare alleles, may be of particular interest. 
In the supporting information, we provide compar- 
isons of the joint distribution of rare alleles seen in 
the data with that from our simulations. These 
comparisons indicate that our model also repro- 
duces well this interesting region of the frequency 
spectrum. Finally, in Fig [4] we compare the model 
and data using larger bins of SNPs specific to spe- 
cific populations or segregating at high or low fre- 
quency. In all cases the model agrees within the 
uncertainty of the bootstrapped data. Taken to- 
gether, these tests suggest that our model provides 
a reasonable, though not complete, explanation of 
the data, lending credence to our demographic es- 



timates. 

The inferred contemporary migration parame- 
ters {mAF-EU , ruAF-AS and niEu-As) are small, 
raising the question as to whether they are statis- 
tically distinguishable from zero. Figure shows 
that the improvement in fit to the real data upon 
adding contemporary migration to the model is 
much larger than would be expected if there were 
no such migration, implying that the contempo- 
rary migration we infer is highly statistically sig- 
nificant. Omitting ancient migration {niAF-B) re- 
duced fit quality even more, indicating that the 
data also demand substantial ancient migration. 



D. Settling the New World 

To study the settlement of the Americas, we 
used the previously considered 22 CEU and 12 
CHB individuals plus an additional 22 individuals 
of Mexican descent sampled in Los Angeles (MXL). 
Data were processed as in our Out of Africa analy- 
sis, yielding 13,290 segregating SNPs from effec- 
tively 4.22 Mb of sequence. Fig [3]A_ shows the 
resulting AFS, while Fig |3p shows the marginal 
spectra. 

A model in which the CEU and CHB diverge 
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FIG. 3. Settlement of the New World 
analysis. As in Fig [2] A) is the data, B) is a 
schematic of the model we fit, C) compares the 
data and model AFS, and D) compares LD. E) 
The fit of our model to the real data is not 
atypical of fits to simulated data. F) The 
improvement in real data fit upon including 
CHB-MXL migration (red line) is very typical of 
the improvement in fits to simulated data without 
CHB-MXL migration. Thus we have no evidence 
for CHB-MXL migration after divergence. 



from an equilibrium population did not reproduce 
the AFS well (supporting information). Interest- 
ingly, a model allowing a prior size change in the 
ancestral population better fit the AFS but very 



poorly fit the observed LD decay (supporting in- 
formation). Thus, reproducing the AFS does not 
guarantee reproduction of LD, at least given a his- 
torically unrealistic model. To develop a more 
realistic model, we endeavored to include the ef- 
fects of Eurasian divergence from and migration 
with the African population. Computational lim- 
its precluded us from considering all 4 populations 
simultaneously, so we dropped the African popu- 
lation from the simulation upon MXL divergence 
(Fig§3). 

Table [njrecords the maximum-likelihood param- 
eter values inferred for this model. Because this fit 
did not include African data, we could not reli- 
ably infer demographic parameters involving the 
African population. Thus, for this point estimate 
we fixed the Africa-related parameters Nj^, N^p, 
Nb, mAF-B, ruAF-EU, mAF~AS, Taf and Tb 
to their maximum- likelihood values from Table [U 
Fig[3p compares the model and data spectra. The 
residuals show little correlation, with the possible 
exception that the model may underestimate the 
number of high-frequency segregating alleles. 

Parameter confidence intervals are reported in 
Table |llj To account for our uncertainty in those 
parameters derived from the Out of Africa fit, for 
each conventional bootstrap fit we used a set of 
Africa-related parameters randomly chosen from 
the sets yielded by our Out of Africa conven- 
tional bootstrap. For the parametric bootstrap, 
we used the maximum-likelihood point estimates. 
Again, we see that the conventional bootstrap 
confidence intervals are comparable to, although 
slightly wider than, the parametric bootstrap in- 
tervals. Several parameters in this analysis have 
direct correspondence with our Out of Africa anal- 
ysis. Of particular note, the confidence intervals 
for the CEU-CHB divergence time Teu-as over- 
lap. 

In assessing goodness of fit, Fig|3p shows that 
this model does indeed reproduce the observed pat- 
tern of LD decay. Unlike in our Out of Africa 
analysis, however, here the LD decay was used to 
choose the form of the model (although not its pa- 
rameter values), so this is not a completely inde- 
pendent assessment of fit. Of our 100 parametric 
bootstrap fits, 13 yielded a worse likelihood than 
the real fit (Fig [3^), for a p-value of 0.13. Ap- 
plying Pearson's test, we find that 23 of 100 
bootstrap fits yield a higher (worse) than the 
fit to the real data, for a p-value of ~ 0.23, simi- 
lar to that of the likelihood analysis. Comparing 
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TABLE II. Settlement of New World inferred parameters 







conventional 


parametric bootstrap 




maximum 


bootstrap 95% 


bias-corrected 95% 


parameter" 


likelihood 


confidence interval 


confidence interval 


Neuo 


1,500 


700 - 2,100 


900 - 2,200 


TEU (%) 


0.23 


0.08 - 0.45 


0.16 - 0.34 


Naso 


590 


320 - 800 


410 - 790 


TAB (%) 


0.37 


0.16 - 0.60 


0.24 - 0.51 


Nmxo 


800 


160 - 1,800 


140 - 1,600 


TMX (%) 


0.50 


0.14 - 1.17 


0.41 - 0.98 


rriEv-AS (xlO"'') 


13.5 


7.5 - 32.2 


9.9 - 20.8 


Teu-as (kya) 


26.4 


18.1 - 43.1 


21.7 - 30.7 


Tmx (kya) 


21.6 


16.3 - 26.9 


18.6 - 24.7 


/MX (%) 


48 


42 - 60 


41 - 55 



"Sec Fig[3j3 for model schematic. Growth rates r and migration rates m are per generation, /mx is the average European 
admixture proportion of the Mexican- Americans sampled. 



distributions of rare alleles, our model typically re- 
produces the observed distribution well, although 
it may be somewhat overestimating the propor- 
tion of alleles that are rare or absent in the CHB 
population (supporting information). In sum, our 
model appears to be a reasonable explanation of 
this data, somewhat better than in our Out of 
Africa analysis. 

An essential feature of the Mexican-American 
individuals considered here is that they are typ- 
ically admixed from Native American and Euro- 
pean ancestors. The ~50% average European ad- 
mixture proportion we inferred for the MXL pop- 
ulation is consistent with previous estimates for 
Los Angeles Latinos !52]. We have no direct data 
from the Native American populations ancestral to 
MXL, but our model does account for their diver- 
gence from East Asia. A model neglecting this di- 
vergence (by setting Tmx to zero) fit the data sub- 
stantially worse and yields an unrealistically high 
average European admixture proportion into MXL 
of 0.68. 

Not only are Mexican- American individuals ad- 
mixed, their admixture proportions also vary, and 
this subtlety is not directly accounted for in our 
analysis. To assess its effect on our results, we first 
roughly estimated the ancestry proportion of each 
individual, using essentially a maximum-likelihood 
version |18j of the algorithm used in structure (53j 
(supporting information) . (Methods based on "ad- 
mixture LD" , which identify breakpoints between 



regions of Native American and European ances- 
try, may be more powerful |54j . However, the strat- 
egy used by the EGP of sequencing widely spaced 
genes will resolve few of these breakpoints, limiting 
the applicability of these methods.) We then per- 
formed additional parametric bootstrap analyses, 
using simulations with a distribution of individual 
ancestry chosen to mimic that seen in the data and, 
to further test the method, with an extremely wide 
distribution. These simulations showed that vari- 
ation in individual ancestry does not bias our pa- 
rameter inferences (supporting information). Re- 
markably, it does not even change our statistical 
power. This is evidenced by the fact that these 
bootstrap simulations yielded confidence intervals 
identical to our original simulations without vari- 
ation in ancestry proportion (supporting informa- 
tion). Nevertheless, future studies may profit by 
incorporating individual ancestry information |18j , 
perhaps inferred from admixture LD. 

Finally, our model allowed us to assess the role 
recurrent migration from Asia played in the settle- 
ment of the New World [2] . When we added CHB- 
MXL migration to our model, we found that the 
maximum likelihood migration rate was 1.7 x 10~^ 
per generation. As shown in Fig [Sji^, the result- 
ing improvement in likelihood is typical (p-value 
«0.45) of fits including CHB-MXL migration to 
data simulated without it. Our data and analysis 
thus yielded no evidence of recurrent migration in 
the settlement of the New World. Note, however. 
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that this simple test does not necessarily rule out 
more complex scenarios, in which migration may 
vary over time. 



E. Nonsynomymous polymorphism 

Polymorphisms that change protein amino acid 
sequence are of medical interest because they are 
particularly likely to affect gene function [55] , 
Correspondingly, they are often subject to nat- 
ural selection. Diffusion approaches are partic- 
ularly useful for studying such nonsynonymous 
polymorphism, because they easily incorporate se- 
lection. Although the diffusion approximation as- 
sumes that sites are unlinked, nonsynonymous seg- 
regating sites are rare enough that this is often a 
reasonable approximation |49j . 

As an illustration, we used our Out of Africa 
demographic model to predict the distribution of 
such variation between continental populations. 
To do so, we must specify a distribution for the 
selective effects of nonsynonymous mutations that 
enter the population. For this we adopted a neg- 
ative gamma distribution whose parameters were 
recently inferred [IH]. The resulting distribution 
of segregating variation is shown in Figure |4|\_. (To 
ease comparison, we have assumed the same scaled 
mutation rate as in the neutral case of Fig|2p.) As 
expected, selection sharply reduces the amount of 
segregating polymorphism. Figure |4j3 shows the 
proportion of variants within various classes. Also 
as expected, selection shifts nonsynonymous vari- 
ation toward lower frequencies, raising the propor- 
tion of singletons and lowering the proportion at 
frequency greater than 10%. Less obviously, it also 
reduces the proportion of variation that is shared 
between populations. In the neutral case, 43% of 
polymorphism is predicted to be present in more 
than one population, while in the selected case only 
35% is. Thus genetic inferences from coding poly- 
morphism may be less transferable between popu- 
lations than might be expected from neutral pat- 
terns of allele sharing. 

In the data considered here, there are about 400 
nonsynomymous polymorphisms segregating in the 
three populations considered. This is too few for a 
detailed goodness-of-fit test of our predicted distri- 
bution. (Although see supporting information for 
a direct AFS comparison.) Nevertheless, we ob- 
serve that our predictions shown in Figure [4|3 all 
lie within the bootstrap 95% confidence intervals 
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FIG. 4. Distribution of nonsynonymous 
polymorphism. We simulated our 
maximum-likelihood Out of Africa demographic 
model with a distribution of selective effects 
previously inferred for nonsynonymous 
polymorphism [25^. A) To enable direct 
comparison with the neutral AFS (Fig[2p), the 
scaled mutation rate 9 was set identically, as is 
the color scale. As expected, selection 
dramatically reduces the amount of segregating 
polymorphism. B) Shown are the proportions of 
variation found in various frequency classes. As 
expected, nonsynonymous variants typically have 
lower frequency. They also less likely to be shared 
between populations. Data error bars indicate 
95% bootstrap confidence intervals. 



from the data. 



IV. DISCUSSION 

Our diffusion approximation to the joint allele 
frequency spectrum is a powerful tool for popu- 
lation genetic inference. Although the diffusion 
approximation neglects linkage between sites, our 
method's computational efficiency allows us to use 
extensive bootstrap simulations to account for the 
effects of linkage. (Let us reiterate that linkage 
does not affect the expected site-frequency spec- 
trum of neutral sites, so our diffusion-based ap- 
proach is estimating the same AFS that coalescent 
simulations are estimating, but in a small fraction 
of the time). We applied our method to human 
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expansion out of Africa and settlement of the New 
World, using public resequencing data from the 
Environment Genome Project. The flexibility of 
the diffusion approach also allowed us to consider 
the distribution of non-neutral variation, which is 
difficult to address with other approaches. Al- 
though no model can capture in detail the com- 
plete history of any population, the models pre- 
sented here help refine our understanding of human 
expansion across the globe. 

Our demographic results are broadly consistent 
with previous analyses of human populations. In 
particular, single-population analyses have also in- 
ferred African population growth and European 
and Asian bottlenecks [H E]. Also, the mi- 
gration rates we infer are similar to those inferred 
by Schaffner et al. [S] but somewhat smaller than 
those of Cox et al. |TS]. On the other hand, Keinan 
et al. [11 inferred no significant migration be- 
tween CEU and CHB. Finally, our estimate of a 
New World founding effective population size in 
the hundreds is compatible other inferences [T^. 

Perhaps our most interesting demographic re- 
sults are the inferred divergence times. Other stud- 
ies im ITS] have estimated divergence times be- 
tween Europeans and East Asians similar to the 
«23 kya we infer. Interestingly, archeological evi- 
dence places humans in Europe much earlier («40 
kya) [1]. Our inferred divergence time of «22 kya 
between East Asians and Mexican-Americans is 
somewhat older than the oldest well-accepted New 
World archeological evidence [2]. The divergence 
we infer may reflect the settlement of Beringia, 
rather than the expansion into the New World 
proper [14|. Finally, the divergence time of «140 
kya we infer between African and Eurasian popu- 
lations is consistent with archeological evidence for 
modern humans in the Middle East wlOO kya pT;, 
but it is much older than other inferences of ~50 
kya divergence from mitochondrial DNA [1]. This 
discrepancy may be explained by our inclusion of 
migration in the model. Migration preserves cor- 
relation between population allele frequencies, so 
an observed correlation across the genome can be 
explained by either recent divergence without mi- 
gration or ancient divergence with migration. In 
fact, the African-Eurasian migration rate we infer 
of «25 X 10~^ per generation is comparable to the 
«100 X 10~^ inferred from census records between 
modern continental Europe and Britain [551 . 

One difficulty in interpreting our divergence 
times is that the sampled populations may not best 



represent those in which historically important 
divergences occurred. For example, the Yoruba 
are a West African population, so the divergence 
time we infer between Yoruba and Eurasian an- 
cestral populations may correspond to divergence 
within Africa itself. Future studies of more popu- 
lations [57l HHJ [59] will help alleviate this difficulty. 

Another difficulty is that the genie loci we study 
here may not be ideal for demographic inference. 
Although we consider only noncoding sequence in 
fitting our historical model, selection on regula- 
tory or linked coding sites may skew the AFS [60, . 
In fact, the EGP data have been shown to differ 
in some ways (e.g. Tajima's D) from intergenic 
regions [53] • Nevertheless, we use the EGP data 
because it is currently the largest public resource 
of noncoding human genetic variation, and we fit 
a neutral model because disentangling the small 
expected effects of selection on these sites from de- 
mographic effects will require additional data. The 
rapidly declining cost of sequencing will give future 
studies access to many more loci that are likely to 
be less influenced by selection. Importantly, the 
computational burden of our method is indepen- 
dent of the amount of sequence used to construct 
the AFS. Additional loci will also increase power to 
discriminate between models and incorporate more 
detail. 

The AFS encodes substantial demographic infor- 
mation. It is has been shown, however, that an iso- 
lated population's AFS does not uniquely and un- 
ambiguously identify its demographic history pT] : 
we expect a similar result to hold for multiple inter- 
acting populations. Moreover, the AFS does not 
capture all the information in the data. As illus- 
trated by the alternative New World models we 
considered, patterns of linkage disequilibrium en- 
code additional information. Future studies may 
profit from coupling our efficient AFS simulation 
with methods that address other aspects of the 
data. 

We have developed a powerful diffusion-based 
method for demographic inference from the joint 
allele frequency spectrum. We applied our method 
to human expansion out of African and the settle- 
ment of the New World, developing models of hu- 
man history that refine our knowledge and raise in- 
triguing questions. We also applied our method to 
predict the distribution of nonsynonymous varia- 
tion across populations, and this prediction is con- 
sistent with the available data. Our methods and 
the models inferred from it offer a foundation for 
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studying the history and evolution of both our own 
species and others. 
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